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construction is extended to multivariate time series, with an iterative selection of variables 
and time delays. A case study of numerically generated solutions of the x— and z coordinates 
of the Lorenz system, and an application to heart rate and respiration data, are used for illustration. 
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Phase space reconstruction by time delay embedding is 
at the center of most nonlinear time series analysis meth- 
ods (see 0, 0] for an introduction). It is a primal goal 
as it ensures, under certain generic conditions, the recon- 
struction of a phase space equivalent to the original one, 
thus allowing a qualitative and quantitative analysis of 
the underlying dynamical system. The most commonly 
used embedding techniques are based on Takens embed- 
ding theorem JJJ , which only considers delay-coordinate 
maps built from a single observable, that is, a scalar 
time series 0, Even though the widespread use 

and paramount importance of this embedding theorem, 
it can be extremely difficult to reconstruct a phase space 
from a scalar time series when more than a couple degrees 
of freedom are active 0], which is the common scenario 
when analyzing biological systems, due to the complexity 
of their structures and complicated dynamics. The moti- 
vation for this study is the increasing interest in reverse- 
engineering biological systems directly from time series 
(see [l^ for example), whose typical multivariate, finite 
and noisy nature renders it particularly important to de- 
velop efficient multivariate embedding schemes 0|. 

Generically consider a smooth deterministic dynamical 
system s{t) = f{s{tQ)), either in continuous or discrete 
time, whose trajectories are asymptotic to a compact 
d-dimensional manifold. By performing fc-dimensional 
measurements on the system, where k — 1, . . . , c?, it is 
possible to define a function X(i) = h[s(t = ix 6)] that re- 
lates the states of the dynamical system throughout time 
with a time series of measured points, where X(j) G M*^, 
i = 1, . . . ,n; n is the total number of sampled points, 
and S is the sampling time. Phase space reconstruction 
by time delay embedding is a method of generating an m- 
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dimensional manifold, from the (n x k) available measure- 
ments, that is equivalent to the original d-dimensional 
manifold. In the scalar scenario, that is, for fc = 1, 
an m-dimensional embedding matrix of delay-coordinate 
column vectors can be defined from the initial time se- 
ries X(j-), as X = [X(j),X(j_|_,-j),X(j_|_.r2)j • ■ • >^(i-|-T(„_i))]- 

Building such a matrix implies estimating two param- 
eters: the time delay r, which is the time displacement 
between successive columns, and the embedding dimen- 
sion TO, which is the dimension, or number of columns, 
of the final embedding matrix. We have recently pro- 
posed (3 a nearest neighbor measure N for time delay 
embedding, solely based on topological and dynamical 
arguments documented by the data. This measure pos- 
sesses the useful feature of retaining the inverse relation- 
ship with structure disclosure, meaning that it first de- 
creases with T, and then returns to higher values when r 
is too long for dynamic coupling to be retained. When 
the time series is noise-free, such t value corresponds to 
the global minimum of TV and an upper limit to an effi- 
cient selection of time delays, beyond which statistical in- 
dependence reflects dynamic decoupling. Furthermore, it 
was shown that using different time delays for consec- 
utive embedding dimensions is more efficient than using 
the same r value for all dimensions, which has been the 
common approach to phase space reconstruction by time 
delay embedding. Hence, the N algorithm will output a 
vector of different time delays [ri,T2, . . . ,T(^m-i)]i as in- 
corporated in the definition of embedding matrix above. 
In this report we extend that nearest neighbor embed- 
ding with different time delays method to multivariate 
time series, that is, when 1 < fc < c?, by selecting, at each 
iteration, the combination of variable, from the initial 
set of k variables, and time delay that first minimizes N. 
As before the false nearest neighbors (F) algorithm 
proposed by Kennel et al. is used to set the final em- 
bedding dimension. This algorithm considers the ratio of 
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Euclidean distances between each and every point and its 
nearest neighbor, first on a m-dimensional and then on 
a (m + l)-dimensional space. If the ratio is greater than 
a given threshold, these points are referred to as false 
nearest neighbors, that is, points that appear to be near- 
est neighbors not because of the dynamics, but because 
the attractor is being viewed in an embedding space too 
small to unfold it. When the fraction of F as a func- 
tion of the embedding dimension decreases to zero, the 
underlying attractor is unfolded and m can be optimally 
estimated. 

The nearest neighbor embedding with different time 
delays algorithm for multivariate time series is detailed 
below, (i) Consider an initial multivariate time series 
X[ixfc]; * = of n measurements of k different 

dynamical variables. For each variable and for each r 
being tested, build a candidate embedding matrix T: 

for i from 1 to A: { 

for T from 1 to { define T = [yi[i.i:k),'^(i+T.j)\ } 

} 

(the upper limit for t is chosen arbitrarily), (ii) For each 
[k + l)-dimensional point, that is, for each row in ma- 
trix T, estimate its l)-dimensional nearest neighbor. 
Calculate the Euclidean distance between them (Iei- (hi) 
Consider both points one sampling unit ahead, still in 
(A: -|- l)-dimensions, and calculate the new Euclidean dis- 
tance between them dE2- (iv) Estimate the ratio dE2/dEi 
and save the number of distance ratios larger than 10. 
That fraction is what is referred to as N in the r select- 
ing profiles ahead. The threshold value, though heuris- 
tically set, is justified by numerical studies and has 
low parametric sensitivity, (v) From the profiles of A'^ 
vs T for each of the k variables, select the combination 
of variable and time delay that first minimizes N. That 
should be the optimal variable ji and time delay ti se- 
lection for this first embedding cycle (that is, each it- 
eration that adds another dimension to the candidate 
embedding matrix), (vi) Estimate the percentage of F 
and save that value as a function of the dimensional- 
ity of the candidate embedding matrix T. (vii) Build 
a putative embedding matrix X ~ [xj-j ^^.j.), xjj^^^ j^-j], 
where 1 < ji < fc. (viii) Again, for each variable and r 
being tested, build a candidate embedding matrix, now 
T = [x(^,l■.k),y^{i+T^.,j^),^i^+T,J)]■ (ix) Repeat steps (ii) to 
(viii), considering that now points are {k + 2)- and more 
dimensional, (x) Stop this iterative procedure when the 
fraction of F [step (vi)] has dropped to 0. The outcome of 
running this procedure for as long as necessary to min- 
imize F is the final m-dimcnsional embedding matrix: 

where r e N and 1 < j < k. 

Two bivariate data sets will be used to illustrate the 
multivariate extension of the nearest neighbor embed- 
ding with different time delays. The first are the x— 
and z coordinates [L{X) and L{Z), respectively] of the 
Lorenz system of differential equations [32; = a(y — x), 
y = x{p — z) ~ y, z = xy — Pz, with parameters 



a = 10, p = 28, /3 = 8/3. The equations were numer- 
ically integrated with a 4-5th order Runge-Kutta algo- 
rithm, sampled at S = 0.01 intervals, and transients were 
removed. The second data set is composed by two phys- 
iological signals, the heart rate [P{H)] and respiration 
[P{R)] from a 49-year-old man diagnosed with sleep ap- 
nea, a potentially life-threatening disorder in which the 
subject stops breathing during sleep. The data was ex- 
tracted from data set B of the 1991 Santa Fe Time Series 
Prediction and Analysis Competition |^. The variables 
were digitized at 250 Hz and then sampled at 0.5 second 
intervals. The units of the P{H) measurements are beats 
per minute, while P{R) is provided in uncalibrated dig- 
itization units (see Q for more detailed information on 
the data set and its pre-processing). When considering 
multivariate time series, normalization is a pivotal pre- 
requisite to overcome scale shifts. Accordingly, we have 
non-parametrically normalized each variable separately 
to its empirical cumulative distribution, by first sorting 
all n values and then replacing them by their rank/n. 
Each data set, as used in the subsequent analysis, in- 
cludes a total of 8000 points, part of which is plotted in 
Fig. 1. The section of data set B used in this report 
includes both a period of apnea and a period of intermit- 
tent apnea. 
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FIG. 1: Data sets. Upper panel: the x [L{X), left] and z 
[L{Z), right] coordinates of the Lorenz system. Lower panel: 
heart rate [P{H), left] and respiration [P{R), right] signals, in 
beats per minute and uncalibrated digitization units, respec- 
tively. In the subsequent analysis, each variable is normalized 
to its empirical cumulative distribution. 

First, consider the case study of two coordinates of 
the Lorenz system. As this is a low-dimensional sys- 
tem, there is no obvious advantage in using multivariate 
time series to reconstruct the phase space. Therefore, 
this example is used only to illustrate the multivariate 
procedure for a well-described system, where the prob- 
lems of noise and nonstationarity, typically encountered 
in biological data sets, are absent. The N profiles for 
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selecting r for the first embedding cycle are displayed 
in Fig. 2(a), where the thick line indicates the pro- 
file for L{X), meaning that the candidate matrix being 
tested is L(Z)(j), L(X)(j_|.^)], and the thin hnc 

indicates the profile for L(Z), meaning that the candi- 
date matrix being tested is 

The variable L{X) and the r value of its N profile first 
minimum ti are the optimal combination that is se- 
lected from this first embedding cycle [Fig. 2(a), cir- 
cle], which implies that the candidate embedding matri- 
ces that would be tested in a second embedding cycle 
would be [L(X)(j),L(Z)(j),L(X)(i+^j),L(X)(i+^)] and 
[L{X)(^i^, L{Z)(^i^, L{X)(,+^^), L{Z),^,+^)]. The rationale 
for using the first minimum was discussed in [2j, where 
it was shown to be the most efficient choice. Displayed 
in Fig. 2(b) is the fraction of F as a function of m 0, 
from which can be concluded that the optimal embedding 
dimension is m = 3, and the final embedding matrix is 
then [i(X)(,),L(Z)(,),L(X)(,+,^)]. 




FIG. 2: (a) First embedding cycle profiles for r selection from 
N. The thick line indicates the profile for L{X) and the thin 
line the profile for L{Z). A circle indicates the optimal com- 
bination of variable and time delay that is selected from this 
first cycle. The global minima of A*' corresponds to the onset 
of dynamic decoupling, (b) The fraction of F as a function 
of m for the embedding of the bivariate L{X) and L{Z) time 
series. See the text for a more comprehensive description. 

Consider now the biological data set. Physiologi- 
cal systems are typically high-dimensional, nonstation- 
ary and contaminated by noise. We have taken no ac- 
tion to correct these problems, in the sense that we 
have used the data as given for the Santa Fe Com- 
petition. The T selecting profiles for the first, sec- 
ond, third and fourth embedding cycles are displayed 
in Figs. 3(a) to 3(d), respectively, where thick lines 
indicate the N profiles for P{H) and thin lines indi- 
cate the profiles for P{R)- The candidate embedding 
matrices are built in the same way as described in the 
Lorenz case study, that is, [P(-ff)(i), P(i?)(i), P(iJ)(i+T-)] 
and [P(_ff)(i), P(i?)(i), P(i?)(i4.7.)] are the candidate em- 
bedding matrices being tested for the first embedding 
cycle [Fig. 3(a)], and for the next embedding cycles, the 
candidate matrices are built based on the combination 
of variable and time delay selected in the previous cycle 
[Figs. 3(a) to 3(d), arrow]. As these are noisy profiles 
(see Q for a comparison with profiles for L{X) contam- 
inated with additive Gaussian white noise), the identi- 



fication of local minima becomes more difficult and is 
sometimes balanced with the identification of a point at 
which a change in the decaying velocity of the profile 
occurs. From the second to the fourth embedding cy- 
cles [Figs. 3(b) to 3(d)], a peak is visible at previously 
selected r values, a feature also present in the noisy pro- 
files reported in 13], which indicates that selecting the 
same r value would not only be a suboptimal choice, as 
it would indeed be the worst possible choice. Displayed in 
Fig. 3(e) is the fraction of F as a function of m 0, from 
which can be concluded that the optimal embedding di- 
mension is m = 6, and the final embedding matrix is then 

[P(i?)(,) , P(i?)(,) , P(i/) + P(il) + P(i?)(,+,3) , 

P(P)(i+r^)]. The embedding of variables with different 
oscillatory frequencies, such as the heart rate P{H) and 
respiration P{R), will initially be biased towards the vari- 
able with the higher frequency. This is clearly visible in 
Fig. 3(a), with P{H) presenting a substantially lower N 
profile than that of P{R)- However, this initial selection 
of P{H) is increasingly less advantageous, until the alter- 
native variable P{R) is favored for the efficiency of the 
embedding. 

The multivariate phase space reconstruction scheme 
could have been conceived in different ways, from embed- 
ding each variable separately and then adding them to- 
gether, to the proposed iterative selection from an initial 
set of variables. The latter is more efficient, in the sense 
that the final embedding dimension is smaller then when 
variables are embedded separately. Such advantage is 
particularly relevant for massively multivariate systems, 
as for example proteomics or transcriptomic time series, 
which include hundreds or even thousands of variables 
. Many of these variables will likely have a very strong 
correlation among themselves. In that case, the most 
efficient phase space reconstruction does not necessarily 
start with the concatenation of all variables without de- 
lay, as the approach suggested in this report. Instead, it 
should start with a single variable, to which additional 
variables, first without any delay, are then added. This 
small variation to the proposed algorithm addresses the 
issue of sufficient representation in multivariate systems 
with tight correlation between parameters. It is inter- 
esting to note that the proposed implementation in fact 
treats each variable as a surrogate for a delayed repre- 
sentation of the other variables. This is also particularly 
well suited for the representation of dynamic behavior 
documented by molecular biology time series for a very 
pragmatical reason - they tend to be very short, in the 
sense that the number of time points is typically many 
fold smaller then the number of variables. 

The reverse engineering of biological processes from 
the time series they generate is often approached by 
parametrization of an explicit mathematical formulation 
[l^ . It can be argued that this approach is hampered by 
the lack of exploratory tools that analyze the dynamic be- 
havior directly to assist in selecting the most explanatory 
independent variables. Furthermore, the characteristic 
heterogeneity in oscillatory frequencies, large number of 
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FIG. 3: (a) First, (b) second, (c) third and (d) fourth embedding cycle profiles for r selection from A''. Thick lines indicate 
profiles for P{H) and thin lines indicate profiles for P{R)- An arrow indicates the optimal combination of variable and time 
delay selected from each embedding cycle. From the second (b) to the fourth (d) embedding cycles, a peak is visible at 
previously selected r values, (e) The fraction of F as a function of m for the embedding of the bivariate P{H) and P(R) time 
series. See the text for a more comprehensive description. 



variables and short duration of the time series creates 
a particular challenge for approaching this exploratory 
analysis through the characterization of a reconstructed 
attractor. Accordingly, this report describes an attempt 
to use the criteria of efficient embedding to achieve that 
goal. 
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